This is an interactive notebook. You can run it locally or use the links below:
How to use Weave with PII data
In this guide, you’ll learn how to use W&B Weave while ensuring your Personally Identifiable Information (PII) data remains private. The guide demonstrates the following methods to identify, redact and anonymize PII data:- Regular expressions to identify PII data and redact it.
- Microsoft’s Presidio, a python-based data protection SDK. This tool provides redaction and replacement functionalities.
- Faker, a Python library to generate fake data, combined with Presidio to anonymize PII data.
weave.op
input/output logging customization and autopatch_settings
to integrate PII redaction and anonymization into the workflow. For more information, see Customize logged inputs and outputs.
To get started, do the following:
- Review the Overview section.
- Complete the prerequisites.
- Review the available methods for identifying, redacting and anonymizing PII data.
- Apply the methods to Weave calls.
Overview
The following section provides an overview of input and output logging usingweave.op
, as well as best practices for working with PII data in Weave.
Customize input and output logging using weave.op
Weave Ops allow you to define input and output postprocessing functions. Using these functions, you can modify the data that is passed to your LLM call or logged to Weave.
In the following example, two postprocessing functions are defined and passed as arguments to weave.op()
.
Best practices for using Weave with PII data
Before using Weave with PII data, review the best practices for using Weave with PII data.During testing
- Log anonymized data to check PII detection
- Track PII handling processes with Weave Traces
- Measure anonymization performance without exposing real PII
In production
- Never log raw PII
- Encrypt sensitive fields before logging
Encryption tips
- Use reversible encryption for data you need to decrypt later
- Apply one-way hashing for unique IDs you don’t need to reverse
- Consider specialized encryption for data you need to analyze while encrypted
Prerequisites
- First, install the required packages.
- Initialize your Weave project.
- Load the demo PII dataset, which contains 10 text blocks.
Redaction methods overview
Once you’ve completed the setup, you can To detect and protect our PII data, we’ll identify and redact PII data and optionally anonymize it using the following methods:- Regular expressions to identify PII data and redact it.
- Microsoft Presidio, a Python-based data protection SDK that provides redaction and replacement functionality.
- Faker, a Python library for generating fake data.
Method 1: Filter using regular expressions
Regular expressions (regex) are the simplest method to identify and redact PII data. Regex allows you to define patterns that can match various formats of sensitive information like phone numbers, email addresses, and social security numbers. Using regex, you can scan through large volumes of text and replace or redact information without the need for more complex NLP techniques.Method 2: Redact using Microsoft Presidio
The next method involves complete removal of PII data using Microsoft Presidio. Presidio redacts PII and replaces it with a placeholder representing the PII type. For example, Presidio replacesAlex
in "My name is Alex"
with <PERSON>
.
Presidio comes with a built-in support for common entities. In the below example, we redact all entities that are a PHONE_NUMBER
, PERSON
, LOCATION
, EMAIL_ADDRESS
or US_SSN
. The Presidio process is encapsulated in a function.
Method 3: Anonymize with replacement using Faker and Presidio
Instead of redacting text, you can anonymize it by using MS Presidio to swap PII like names and phone numbers with fake data generated using the Faker Python library. For example, suppose you have the following data:"My name is Raphael and I like to fish. My phone number is 212-555-5555"
Once the data has been processed using Presidio and Faker, it might look like:
"My name is Katherine Dixon and I like to fish. My phone number is 667.431.7379"
To effectively use Presidio and Faker together, we must supply references to our custom operators. These operators will direct Presidio to the Faker functions responsible for swapping PII with fake data.
Method 4: Use autopatch_settings
You can use autopatch_settings
to configure PII handling directly during initialization for one or more of the supported LLM integrations. The advantages of this method are:
- PII handling logic is centralized and scoped at initialization, reducing the need for scattered custom logic.
- PII processing workflows can be customized or disabled entirely for specific intergations.
autopatch_settings
to configure PII handling, define postprocess_inputs
and/or postprocess_output
in op_settings
for any one of the supported LLM integrations.
Apply the methods to Weave calls
In the following examples, we will integrate our PII redaction and anonymization methods into Weave Models and preview the results in Weave Traces. First, we’ll create a Weave Model. A Weave Model is a combination of information like configuration settings, model weights, and code that defines how the model operates. In our model, we will include our predict function where the Anthropic API will be called. Anthropic’s Claude Sonnet is used to perform sentiment analysis while tracing LLM calls using Traces. Claude Sonnet will receive a block of text and output one of the following sentiment classifications: positive, negative, or neutral. Additionally, we will include our postprocessing functions to ensure that our PII data is redacted or anonymized before it is sent to the LLM. Once you run this code, you will receive a links to the Weave project page, as well as the specific trace (LLM calls) you ran.Regex method
In the simplest case, we can use regex to identify and redact PII data from the original text.Presidio redaction method
Next, we will use Presidio to identify and redact PII data from the original text.
Faker and Presidio replacement method
In this example, we use Faker to generate anonymized replacement PII data and use Presidio to identify and replace the PII data in the original text.
autopatch_settings
method
In the following example, we set postprocess_inputs
for anthropic
to the postprocess_inputs_regex()
function () at initialization. The postprocess_inputs_regex
function applies theredact_with_regex
method defined in Method 1: Regular Expression Filtering. Now, redact_with_regex
will be applied to all inputs to any anthropic
models.
(Optional) Encrypt your data
